Confidence in comparative genomics.

نویسنده

  • Elliott H Margulies
چکیده

Comparative sequence analysis has become a widespread approach for identifying and characterizing functional elements encoded within genomic sequences. Marked by early successes (for review, see Hardison 2000), a tremendous amount of sequencing capacity has been, and continues to be, utilized for sequencing genomes of related species. Indeed, the choice of genomes selected for sequencing has less to do with the biology or utility of a particular species as an experimental model organism, but rather is guided more by their placement on the evolutionary tree of life. Optimal species are now characterized by an evolutionary distance (typically measured in neutral substitutions per site) that maximizes both sequence alignability and the ability to distinguish neutral DNA from sequences under evolutionary selection. This concept is exemplified in the mammalian species selected for low-redundancy whole-genome shotgun sequencing (Margulies et al. 2005; Green 2007) as well as the 12 fly genomes selected for comparative analyses (Drosophila 12 Genomes Consortium 2007; Stark et al. 2007). With the increased availability of all these species’ genomes, various algorithms have been developed to aid in the identification of sequences under purifying selection (Blanchette and Tompa 2002; Boffelli et al. 2003; Margulies et al. 2003; Cooper et al. 2005; Siepel et al. 2005, 2006), which is Nature’s way of pointing out sequences that have remained highly similar throughout evolution and have thus been “constrained” for some function, even when we don’t know what function that is. Equally interesting are methods to detect genomic sequences under positive selection (Clark et al. 2003; Nielsen et al. 2005; Pollard et al. 2006; Prabhakar et al. 2006; Kim and Pritchard 2007), which highlight rapidly evolving regions that have acquired new functions and might point to functional sequences that make species unique. In addition, comparative sequencing efforts have had a major impact on studies of evolutionary biology, helping to resolve disputed evolutionary relationships and elucidate mechanisms by which evolution has occurred (e.g., Murphy et al. 2001; Nikolaev et al. 2007). Yet, with all these advances, there still remains a “single point of failure” in the field of comparative genomics—virtually all analyses rely on the generation of a pre-computed multisequence alignment. These alignments are typically generated by programs that use a number of computational “short cuts” (such as a progressive alignment approach) to make the task of building genome-wide alignments feasible. While methods that combine the alignment task with other inferences have also been developed (Alexandersson et al. 2003), they are not widely used in large-scale studies because of their complexity and computational cost. Importantly, recent studies have shown that dramatic differences exist between multi-sequence alignments produced by different algorithms (Margulies et al. 2007; Prakash and Tompa 2007), despite the fact that these alignments are attempting to achieve similar goals from the exact same sequence datasets. The manuscript by Lunter and colleagues in this issue (Lunter et al. 2008) describes an interesting solution to the challenge of imperfect alignments. They first provide a thoughtful and systematic analysis of the three major classes of biases found in sequence alignments: (1) gap/edge wander, resulting from the incorrect placement of gaps due to spurious nonhomologous similarity; (2) gap attraction, resulting in the joining of two closely positioned gaps into one larger gap; and (3) gap annihilation, resulting in the deletion of two indels of equal size for a typically more favorable representation as substitutions. Examples of these three biases are nicely illustrated in Figure 1 of their manuscript (Lunter et al., this issue). From this initial analysis, they conclude that certain regions of pairwise alignments do not have a single theoretically correct solution. Even when the full evolutionary model is known, multiple evolutionarily plausible possibilities exist. Thus, we may never know with certainty the correct homology in certain regions of pairwise alignments. Their approach to overcoming this challenge is rather elegant and attacks the problem from a different perspective: Instead of trying to get the alignment correct (which they show might not be possible), they “flag” alignment columns that have a high probability of not being correct. While such a solution will not solve the challenges upstream of the alignment process (namely, identifying the correct orthologous sequences to align in the first place), their approach does help negate a major contributor to false-positive/negative results in downstream comparative sequence analyses. It is encouraging that their approach should also be amenable to multi-sequence alignments, since they are typically built up from a series of pairwise alignments. More than 15% of aligned bases are estimated to be incorrect in currently available whole-genome alignments between human and mouse (Lunter et al. 2008). While modest improvements were made on simulated alignments by more careful modeling of the evolutionary process (in particular, with respect to G + C content and distribution of indel lengths), the majority of alignment errors could not be resolved, reinforcing the need for a probabilistic approach in multi-sequence alignment analyses. These results led them to develop a posterior decoding algorithm that explicitly models uncertainties in inferred alignments. Alignment uncertainty is of particular concern in noncoding regions of mammalian genomes, which are notably difficult to align but also of great interest for identifying regulatory sequences. With this new “probability of correctness” information that can be assigned to each column of a multi-sequence alignment, one can envision new approaches that incorporate confidence measures in myriad downstream comparative sequence analyses. In essence, we now know which parts of the alignment we can trust and which parts might be suspect—not because the alignment algorithm failed, but because there is no single highly probable result. 1Corresponding author. E-mail [email protected]; fax (301) 480-3520. Article is online at http://www.genome.org/cgi/doi/10.1101/gr.7228008. Commentary

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparative genomics of human stem cell factor (SCF)

Stem cell factor (SCF) is a critical protein with key roles in the cell such as hematopoiesis, gametogenesis and melanogenesis. In the present study a comparative analysis on nucleotide sequences of SCF was performed in Humanoids using bioinformatics tools including NCBI-BLAST, MEGA6, and JBrowse. Our analysis of nucleotide sequences to find closely evolved organisms with high similarity by NCB...

متن کامل

Validating regulatory predictions from diverse bacteria with mutant fitness data

Although transcriptional regulation is fundamental to understanding bacterial physiology, the targets of most bacterial transcription factors are not known. Comparative genomics has been used to identify likely targets of some of these transcription factors, but these predictions typically lack experimental support. Here, we used mutant fitness data, which measures the importance of each gene f...

متن کامل

A Comparative Study of Self-Confidence from the Perspectives of Quran, Ahadith and Psychology

Background and Objectives: Self-confidence, referring to relying on and exploiting individual abilities and talents for realizing spiritual and material prosperity, not only contradicts but is in agreement with and a prerequisite for faith in God. Practically, the greater faith in God one has, the more potent self-confidence he /she possess. Investigating the subject of self-confidence from the...

متن کامل

Onm-1: Self-Confidence in Women with and without Polycystic Ovary Syndrome

Background: Polycystic Ovary Syndrome (PCOS) as other chronic diseases causes dependence and decreasing body control. It also causes unworthy unsafely, negative suggestions and decreasing of Self-confidence. The aim of this comparative study was comparing Selfconfidence in woman with and without PCOS according to their ages. Materials and Methods: This comparative study conducted between 100 wo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Genome research

دوره 18 2  شماره 

صفحات  -

تاریخ انتشار 2008